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© Dynamically reconfigurable interprocessor a 



n network for SIMD multi-processors and 



© In a SIMD architecture having a two-dimensional 
array of processing elements (10), where a controller 
broadcasts instructions to all processing elements 
(10) in the array, a dynamically reconfigurable 
switching means (14) useful to connect four of the 
processing elements (10) in the array into a group in 
accordance with either the broadcast instruction of 
the controller or a special communication instruction 
held in one processing element (10) of the group is 
provided. The switch (14) includes at least one data 
line (54) connected to each processing element (10) 
in the group. A multiplexer is connected to each data 
line (54) and to the controller and to a configuration 
register. It is adapted to load the special commu- 
nication instruction from the one processing element 
(10) in the group into a configuration register and to 
operate in accord with either the broadcast instruc- 
tion from the controller or the contents of the con- 
figuration register to select one of the four data lines 
(54) as a source of data and applying the data 
therefrom to a source output port. A demultiplexer is 
connected to each data line (54) and to the controller 
and to said configuration register, and to the source 
output port of the multiplexer means, and adapted to 
operate in accord with either the broadcast instruc- 
tion from the controller or the contents of the con- 
figuration register to select one of the four data lines 
(54) as a source of data and applying the data 



therefrom to a selected data line (54). The switch 
(14) also acts to connect processing elements (10) 
that cross chip partitions forming the processor ar- 
ray. Up to four such switches (14) can be used to 
connect a group of four processing elements (10). 




FIG. 2. 
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BACKGROUND OF THE INVENTION 

1 . Field of the Invention 

This invention relates in general to computer 
architectures, and, more particularly, to a dynam- 
ically reconfigurable switching means useful in con- 
necting processing elements in processor arrays in 
SIMD multi-processor architectures. 

2. Description of the Related Art 

In processing arrays of processing elements 
found in SIMD multi-processor architectures, inter- 
processor communications connecting the eight 
nearest neighboring processing elements through 
four links per each processing element (PE) have 
been implemented in the X-net switch used in a 
commercially available machine (MasPar's MP-1), 
and in a university developed machine (BLITZEN 
at the University of North Carolina). However, none 
of these designs exhibit an ability for the dynamic 
reconfigurability of their switching units as is found 
in the present invention. 

In the literature, reconfigurable nearest neigh- 
bor communications in processing arrays have 
been described at a very high level in 
S.J.Tomboulian, A System for Routing Arbitrary 
Communication Graphs on SIMD Architectures , 
Ph.D. Dissertation, Duke University, 1986. However, 
this description is given in a general and abstract 
fashion with no implementation details. 

SUMMARY OF THE INVENTION 

This invention describes a dynamically recon- 
figurable interprocessor communication network for 
dynamically controlling data flow between process- 
ing elements in processor arrays found in parallel 
Single Instruction stream Multiple Data stream 
(SIMD) computer architectures. Central to this in- 
vention is the design of a Dynamically Recon- 
figurable Switch (DRS) used to transfer data be- 
tween neighboring processing Elements (PE) in a 
processing array. A number of important and novel 
architectural features are attained through the use 
of the present invention. These include: requiring a 
minimal number of interprocessor connections per 
PE and allowing for local communication autonomy 
at the level of each PE. The former is important 
from a physical implementation point of view, and 
the later is significant for adding a new dimension 
of flexibility to existing SIMD architectures. 

Dynamic reconfigurability of the processing 
elements in the processing array is achieved by 
associating with each PE a distinct DRS embody- 
ing the present invention. The configuration of the 
DRS, i.e., switch settings required to pass data 



from an input port to an output port, can be either 
set by the instructions received by all PEs (broad- 
cast instructions) from the controller, or be in- 
dependently set to a value stored locally in each 

5 PE's memory area. Special consideration has been 
made with respect to cascading issues arising 
when data is communicated with PEs forming the 
processing array located in different physical chips 
forming partitions of the processor array. 

w At least two primary features of this invention 
set it apart from other more conventional inter- 
processor communication schemes. 

First, the design of the present invention is 
very efficient in using interprocessor communica- 

15 tion links by exploiting certain communication con- 
straints. In this design, only four bidirectional con- 
nections are required per PE to implement direct 
eight nearest neighbor communications. Further- 
more, only a single bidirectional wire per data path 

20 bit is required to communicate between corre- 
sponding switches in neighboring chips or across 
partition boundaries. Consequently, the number of 
pins required for interprocessor communication 
across chip boundaries is reduced, allowing for 

25 more PEs to be implemented on a given chip. 

The second significant feature of this design is 
the added communication flexibility achieved from 
dynamic reconfigurability of the DRS during pro- 
cessing. Typically, in SIMD parallel processing ar- 

30 chitectures, data movement between PEs is ac- 
complished in a homogeneous fashion, see Figures 
5A and 5B. A pre-selected direction, e.g., North, 
East, South, West, etc., is supplied with each com- 
munication instruction. Through use of the DRS 

35 design of the present invention, each PE can in- 
dependently set a specific direction for transfer of 
data to, or from, its neighboring PEs. This ability to 
dynamically reconfigure and direct data flow during 
processing allows for construction of complex com- 

40 putational paths which better match a specific al- 
gorithm. See Figures 6A and 6B. For example, a 
number of mapping methods have been developed 
to efficiently take advantage of this dynamic recon- 
figurability of data flow through a processing array 

45 for efficiently processing neural network algorithms. 
Similar dynamic mapping methods can potentially 
be generalized to address problems in other fields. 

In general, therefore, the present invention is 
found to be embodied in a SIMD architecture hav- 

50 ing a two dimensional array of processing ele- 
ments, where a controller broadcasts instructions to 
all processing elements in the array, a dynamically 
reconfigurable switching means useful to connect 
four of the processing elements in the array into a 

55 group in accordance with either the broadcast in- 
struction of the controller or a special communica- 
tion instruction held in one processing element of 
the group, and would include at least one dataline 
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being at least one bit wide, connected to each 
processing element in the group. A multiplexer unit 
is connected to each data line, the controller and to 
a configuration register. It is adapted to load the 
special communication instruction from the one 5 
processing element in the group into a configura- 
tion register and to operate in accord with either 
the broadcast instruction from the controller or the 
contents of the configuration register to select one 
of the four data lines as a source of data and 10 
applying the data therefrom to a source output 
port. Similarly, a demultiplexer unit is connected to 
each data line, the controller and to the configura- 
tion register, as well as to the source output port of 
the multiplexer unit. The demultiplexer is adapted w 
to operate in accord with either the broadcast in- 
struction from the controller or the contents of the 
configuration register to select one of the four data 
lines and applying the data from the source output 
port of the multiplexer unit thereto. 20 

The present invention is also embodied in a 
SIMD architecture having a two dimensional array 
of processing elements, with multiple chips con- 
taining the processing elements of the array, where 
a controller broadcasts instructions to all process- 25 
ing elements in the array, a dynamically recon- 
figurable switching means useful to connect four of 
the processing elements in the array into a group 
which may cross chip boundaries to form partitions 
with each partition associated with one chip, to so 
direct data movement dynamically between select- 
ed processing elements of the group in accordance 
with either the broadcast instruction of the control- 
ler or a special communication instruction held in 
one processing element of the group. In this par- 35 
titioned situation, the switch would include, in each 
partition, at least one dataline connected to each 
processing element in the group in the partition. A 
multiplexer unit is connected to each data line, the 
controller and to a configuration register. It is 40 
adapted to load the special communication instruc- 
tion from the one processing element in the group 
into a configuration register and to operate in ac- 
cord with either the broadcast instruction from the 
controller or the contents of the configuration regis- 45 
ter to select one of the four data lines as a source 
of data and applying the data therefrom to a source 
output port. A demultiplexer unit is connected to 
each data line, the controller, the configuration reg- 
ister, and to the source output port of the mul- so 
tiplexer unit. The demultiplexer is adapted to op- 
erate in accord with either the broadcast instruction 
from the controller or the contents of the configura- 
tion register to select one of the four data lines and 
applying the data from the source output port of 55 
the multiplexer thereto. A dataline connects each 
multiplexer in one partition to the demultiplexer in 
the same partition, and a crossing dataline con- 



nects each multiplexer in one partition to the de- 
multiplexer in each other partition. 

The demultiplexer unit can also be adapted to 
operate in accord with either the broadcast instruc- 
tion from the controller or the contents of the 
configuration register to apply the data from the 
source output port of the multiplexer means to at 
least two of the data lines. 

In the partitioned situation, the dataline con- 
necting each multiplexer unit in one partition to the 
demultiplexer unit in the same partition and the 
crossing dataline connecting each multiplexer in 
one partition to the demultiplexer in each other 
partition may be a single dataline. 

Likewise, in both the partitioned and non-par- 
titioned situations described above, the present in- 
vention is also embodied in a dynamically recon- 
figurable switching means that further includes be- 
tween one and four of such switches in a configura- 
tion as shown in Figure 2A. The processing ele- 
ment has at least one input and one output register 
associated with its input/output data lines. Thus, in 
a given configuration using for example two switch- 
es per group, two simultaneous data transfers can 
be implemented from any pair of processing ele- 
ments in the group, to any other pair of processing 
elements in the same group. For this embodiment, 
each switch in the group has its own dedicated 
configuration register which can be loaded by any 
of the processing elements in its group. 

The description of the invention presented is 
intended as a general guideline for the design of 
the invention into a specific implementation. There- 
fore, implementation specific details of the design 
are left to be determined based on the implementa- 
tion technology and the allotted cost of the final 
product. In particular, the novel features of con- 
struction and operation of the invention will be 
more clearly apparent during the course of the 
following description, reference being had to the 
accompanying drawings wherein has been illus- 
trated a preferred form of the device of the inven- 
tion and wherein like characters of reference des- 
ignate like parts throughout the drawings. 

BRIEF DESCRIPTION OF THE FIGURES 

FIG. 1 is an idealized block schematic diagram 
illustrating the top level design of a computer 
architecture embodying the present invention; 
FIG. 2 is an idealized block schematic diagram 
illustrating a single processor and its associated 
DRS interprocessor communication switches; 
FIG. 2A is an idealized block schematic diagram 
illustrating a connection scheme for four proces- 
sors and associated four DRS interprocessor 
communication switches in an alternate configu- 
ration than that of FIG. 2; 
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FIG. 3 is an idealized block schematic diagram 
illustrating a reconfigurable interprocessor com- 
munication switch; 

FIG 4 is an idealized block schematic diagram 
illustrating the top level design of a processing 
element; 

FIG. 5A is an idealized block schematic diagram 
illustrating homogeneous data movement be- 
tween processing elements using the bypass 
mode of a processing array incorporating the 
present invention in a shift East; 
FIG. 5B is an idealized block schematic diagram 
illustrating homogeneous data movement be- 
tween processing elements using the bypass 
mode of a processing array incorporating the 
present invention in a shift South; 
FIG. 6A is an idealized block schematic diagram 
illustrating inhomogeneous data movement be- 
tween processing elements using different val- 
ues in the Conf_Reg of the present invention to 
implement a one dimensional ring structure on 
the processor array; 

FIG. 6B is an idealized block schematic diagram 
illustrating inhomogeneous data movement be- 
tween processing elements using different val- 
ues in the Conf_Fteg of the present invention to 
implement arbitrary complex flow patterns on 
the processor array with the numbers above the 
linking arrows indicating the communication cy- 
cle corresponding to the specific data move- 
ment; 

FIG. 7 is an idealized block schematic diagram 
illustrating a single dynamically reconfigurable 
switch embodying the present invention and its 
associated processing elements in the proces- 
sor array; 

FIG. 8 is an idealized block schematic diagram 
illustrating a single dynamically reconfigurable 
switch embodying the present invention and its 
associated processing elements in the proces- 
sor array with East-West interprocessor commu- 
nications between neighboring processing ele- 
ments across chip boundaries; 
FIG. 9 is an idealized block schematic diagram 
illustrating a single dynamically reconfigurable 
switch embodying the present invention and its 
associated processing elements in the proces- 
sor array with North-South and East-West inter- 
processor communications between neighboring 
processing elements across chip boundaries; 
FIG. 10 is an idealized block schematic diagram 
illustrating a single dynamically reconfigurable 
switch embodying the present invention and its 
associated processing elements in the proces- 
sor array with North-South and East-West inter- 
processor communications between neighboring 
processing elements across chip corners; and, 



FIG. 11 is an idealized block schematic diagram 
illustrating the interconnection of four chips con- 
taining processing elements in the processing 
array utilizing dynamically reconfigurable switch- 
5 es embodying the present invention and inter- 
processor communications between neighboring 
processing elements across chip boundaries. 

DETAILED DESCRIPTION OF THE PREFERRED 
70 EMBODIMENT 

With reference being made to the Figures, a 
preferred embodiment of the present invention is 
found in a computer architecture that can roughly 

15 be classified as a Single Instruction stream Multiple 
Data streams (SIMD) medium or fine grain parallel 
computer. The top level architecture of such an 
embodiment is depicted in Figure 1 where each 
Processing Element 10 is arranged on a two di- 

20 mensional lattice 12 and is connected to eight of its 
closest neighbors through four programmable 
switches 14. 

The architecture is most easily discussed in 
three major units: the host computer 16, the con- 

25 trailer 18, and the Processor Array 20. Its memory 
can be accessed by the host computer 16 through 
the use of high speed Direct Memory Access 
(DMA) channel. The host computer 16 needs only 
to specify the memory location of the data block to 

30 be accessed in the memory space and the number 
of words to be transferred. 

The DMA controller 19 can transfer the data 
without using any additional cycles from the host 
computer 16. This feature permits simple program- 

35 ming interface between the host 16 and the 
coprocessor. The host computer 16 is primarily 
used for properly formatting the input data, long 
term storage of data, and as a visual interface 
between the user and the invention. 

40 The controller unit 18 interfaces to both the 
host computer 16 and to the Processor Array 20. 
The controller 18 contains a microprogram memory 
area 23 that can be accessed by the host 16. High 
level programs can be written and compiled on the 

45 host 16 and the generated control information can 
be down loaded from the host 16 to the micropro- 
gram memory 23 of the controller 18. The control- 
ler 18 broadcasts an instruction and a memory 
address to the Processor Array 20 during each 

so processing cycle. The processors 10 in the Proces- 
sor Array 20 perform operations received from the 
controller 18 based on a mask flag available in 
each Processing Element 10. 

The Processor Array unit 20 contains all the 

55 processing elements 10 and the supporting inter- 
connection switches 14. Each Processing Element 
10 in the Processor Array 20 has direct access to 
its local column of memory within the architecture's 
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memory space 22. Due to this distributed memory 
organization, memory conflicts are eliminated 
which consequently simplifies both the hardware 
and the software designs. 

In this architecture, the Processing Element 10 
makes up the computational engine of the system. 
As mentioned above, the processing Elements 10 
are part of the Processor Array 20 subsystem and 
all receive the same instruction stream, but perform 
the required operations on their own local data 
stream. Each processing Element 10 is comprised 
of a number of Functional Units 24, a small register 
file 26, interprocessor communication ports 28, and 
a mask flag 30 as illustrated in FIG. 4. 

The functional units 24 in each Processing 
Element 10 include an adder, multiplier, shift/logic 
unit. Depending on the specific implementation, 
additional functional units 24 can be added to the 
design. Similar to many RISC type processors, 
each Processing Element 10 has an internal data 
bus 32 to transfer data between various units. For 
example, data could move from a register in the 
register file 26 to one of the operand registers of 
the adder unit, or data can be transferred from the 
output register of the multiplier to the I/O output 
port 28. A mask bit is used to enable/disable the 
functional units 24 from performing the operation 
instructed by the controller 18. 

Each Processing Element 10 communicates 
with its neighbors through the I/O ports 28 . Only 
one input and one output port 28 is required in 
each Processing Element 10 since during each 
systolic cycle only a single data value is received 
and transmitted by a Processing Element 10. In 
configurations utilizing more than one switch, such 
as the one shown in FIG. 2A, multiple Input/Output 
ports can be used for simultaneous data transfer 
between members of a group. The output of each 
of the I/O registers is connected to the four switch- 
es 14 surrounding each Processing Element 10. 
The selection of the destination Processing Ele- 
ment 10 for an outgoing data value and the source 
Processing Element 10 for an incoming data value 
is made by the specific switch settings of the 
switches 14. 

Each processor 10 in the architecture has 
read/write access to its own local memory area 38 
as shown in FIG. 4. The memory can be kept off 
chip in order to allow for simple memory expan- 
sion. During each processing cycle, the memory 
location associated with each instruction is broad- 
casted by the controller 18 unit to all the Process- 
ing Elements 10. In this fashion, all the processors 
10 can access a single plane of memory at each 
time step. The memory access speed is matched 
with the computation rate of each processor 10 so 
that each memory access can be completely over- 
lapped with computation. This feature allows for 



efficient systolic processing. 

Each word in the Processing Element's local 
memory area 38 is preferably comprised of two 
distinct fields 40, 42. The first 40 is the data field 

5 used to store and retrieve the actual data asso- 
ciated with the computation, such as neuron activa- 
tion values, synaptic weight values, etc.. The sec- 
ond field 42 holds three bits which indicate the 
switch setting for the switch 14 associated with 

10 each Processing Element 10. The three configura- 
tion bits can be decoded to select one of the eight 
configurations shown in FIG. 7. These switch set- 
ting values are determined prior to the start of 
computation and are preloaded by the host 16. A 

15 new switch setting value can be read during each 
instruction cycle. 

Nearest neighbor communications in the Pro- 
cessor Array 20 are performed through dynamically 
reconfigurable switches 14 connecting one Pro- 

20 cessing Element 10 to three of its eight nearest 
neighbors. There are four switches 14 connected to 
each Processing Element 10, see Figure 2. 

An alternate configuration for nearest neighbor 
communications in the Processor Array 20 is 

25 shown in Figure 2A where up to four dynamically 
reconfigurable switches 14 are used in each group 
of four processing elements. In this configuration, 
in both the partitioned and non-partitioned situ- 
ations described above, each processing element 

30 10 has at least one input and one output register 
associated with its input/output data lines. Thus, in 
a given configuration using for example two switch- 
es per group, two simultaneous data transfers can 
be implemented from any pair of processing ele- 

35 ments in the group, to any other pair of processing 
■ elements in the same group. For this embodiment, 
each switch in the group has its own dedicated 
configuration register which can be loaded by any 
of the processing elements in its group. 

to Referring once again to the configuration in 
Figure 2, dynamically reconfigurable switch 14 al- 
lows for communication between four nearest- 
neighbor Processing Elements 10 while only requir- 
ing at least one I/O connection 54 to each Process- 

45 ing Element 10. The communication bandwidth be- 
tween adjacent Processing Elements 10 are kept 
equal to the memory access bandwidth to assure 
efficient systolic processing. One unique feature of 
the present architecture is to allow each switch 14 

so in the interconnection network to be distinctly con- 
figured. 

The switch settings are stored in the local 
memory 38 of each switch 14 and can be acces- 
sed at the beginning of each processing cycle. The 
55 switch memory address is supplied by the control- 
ler 18 at each cycle. This design allows for a 
dynamically changing flow pattern of data through 
the processing array 20. In other two-dimensional 
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connected parallel SIMD architectures, all the Pro- 
cessing Elements 10 perform the same commu- 
nication operation. In other words, the instruction 
word broadcasted by the controller 18 specifies 
which direction the data is to be moved in the 
processing array 20, e.g., North to South, East to 
West, West to South, etc. In the presently de- 
scribed architecture, on the other hand, one Pro- 
cessing Element 10 can receive data from its North 
Neighbor while another is receiving data from its 
West Neighbor, depending on the local switch set- 
tings. An example implementation of this switch 14 
is shown in Figure 3 using a multiplexer 48 and 
demultiplexer 50 to select one of four inputs and 
route the data to one of four outputs 54. This 
flexibility in interprocessor communication is essen- 
tial in efficiently implementing neural networks with 
sparse or block structured interconnections. 

A discussion of the operation of this invention 
in an architectural implementation of nearest neigh- 
bor communications in an array grid of Processing 
Elements (PEs) is now given. 

The interprocessor communications between 
processing elements in the processing array are 
performed through Dynamically Reconfigurable 
Switches (DRS) connecting one Processing Ele- 
ment to three of its eight nearest neighbors. There 
are four Processing Elements connected to each 
DRS cell. 

A configuration register within the DRS controls 
the selection of source and destination ports for the 
data flowing through the switch. The content of this 
configuration register can be set by only one of the 
processing elements connected to the switch, e.g., 
the upper left Processing Element of the DRS. Due 
to the extra overhead associated with loading new 
configuration values into the configuration register, 
a by-pass mode may be implemented so that the 
data movement direction is supplied by the specific 
communication instruction. When using the bypass 
mode, data movement in the processor array is 
homogeneous, like that of conventional SIMD par- 
allel processors. In this mode the instruction speci- 
fies the direction of data movement for all the 
Processing Elements in the system. See Figures 
5A and 5B. 

On the other hand, by using the configuration 
register, inhomogeneous communication patterns 
can be formed on the processing array. A unique 
feature of this invention is to allow each switch in 
the interconnection network to be distinctly config- 
ured. The configuration register can be configured 
once at the start of processing to implement a 
specific communication pattern, such as construct- 
ing a one dimensional computational ring on the 
array as shown in Figure 6A without ever changing 
during the computation. On the other hand, it can 
be dynamically reconfigured to form complex flow 



patterns on the processing array for each commu- 
nication step as shown in Figure 6B. 

The 3-bit Conf_Reg specifies one of eight 
directions the data is to be moved, e.g., South to 

5 North, East to West, West to South (see Figure 7). 
For a given configuration mode, a multiplexer and 
demultiplexer are used to select one of four inputs 
and route the data to one of four outputs as listed 
in Figure 7. Associated with each Conf_Reg there 

w ate additional decoding hardware used to generate 
the appropriate control signals. For the sake of 
clarity, these details have been omitted in all the 
drawings. Each PE is identified in all drawings by 
its row and column coordinate (r.c). For example, 

15 PE 1.1 represents PE located at row one and 
column one in the processor Array (PA). The DRS 
bus width is not explicitly defined and can vary 
from 1 to m bits. 

If the required Processor Array is larger than a 

20 single chip (e.g., each chip contains n x n PEs), 
then the DRS cells of the PEs located at the chip 
boundaries need to be partitioned among multiple 
chips. An example partition of this switch is shown 
in Figure 8. A single bidirectional wire per bus bit is 

25 used to connect neighboring PEs in the west side 
to corresponding neighbors in the east side. In 
addition to r.c. coordinates within a chip, each PE 
is identified by its chip number. For example, PE 
C3_1.n represents PE located on row 1 and col- 

30 umn n in chip 3. Similarly, PE C3_1 .n_out and 
PE C3_1.n_in represent output channel and input 
channel respectively, of this PE. Since the switch 
shown is controlled by PE C3_1.n on the west 
side, the configuration information needs to be 

35 transmitted across the chip boundary to the East 
half of the switch before the data communication 
channel can be established. Therefore, two 
Conf_Reg are required, one for the West half and 
one for the East half. The extra copy of the 

40 Conf_Reg on the East half is needed to prevent 
both West-side Mux and East-side mux from driv- 
ing the same bus simultaneously. 

There is one wire per data bus width connect- 
ing the West side of chip 3 pin #1 (i.e., C3_E1) to 

45 the East side of chip 4 pin #1 (i.e., C4_W1). The 
data channel and configuration channel share this 
same bus wire connecting two neighboring chips; 
the instruction to load the Conf_Reg will pre- 
configure the west-half of the switch cell to select 

so PE C3_1.n_out so that the extra copy of 
Conf_Reg on the east half of the switch can be 
initialized. For a bus width of 1 bit, this instruction 
takes four cycles to execute, i.e., one cycle to read 
the configuration mode from the memory of PE 

55 C3_1.n into the Conf_Reg of the west-half of the 
DRS and 3 cycles to load this information into 
Conf_Reg of the East-half of the DRS. This could 
be a large overhead for those operations which do 
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12 



not require each switch in the interconnection net- 
work to be distinctly configured. Therefore, the 
switch cell will be allowed to bypass the Conf_Reg 
and take the configuration mode information di- 
rectly from the instruction field, as explained ear- 
lier. Figure 9 shows North-South partition of the 
DRS. There is one wire connecting the South side 
of chip 2 pin #1 (i.e., C2_S1) to the North side of 
chip 4 pin #1 (i.e., C4_N1). The procedure for 
loading the configuration value into the North-south 
halves of the DRS is similar to the one described 
for the West-East direction. 

Figure 10 shows corner partition of the DRS, 
which spreads over four adjacent corners of four 
chips. There is one copy of Conf_Reg on each 
corner of the switch. The load Conf_Reg instruc- 
tion will pre-configure the upper-left corner DRS 
cell to select PE C1_n.n_out, so that all three 
extra copies of Conf_Reg on each corner of the 
switch can be initialized. There is one wire con- 
necting the South side of chip 1 pin #n (i.e., 
C1_Sn) to the North side of chip 3 pin #n (i.e., 
C2_Wn). One extra wire connects the south side 
of chip 2 pin #0 (i.e., C2_S0) to the north side of 
chip 4 pin #0 (i.e., C4_N0) so that chip 1 and chip 
4 can have diagonal connection between them. 
Therefore, there are (n + 1) wires (i.e., S0_N0 to 
Sn_Nn) in the North-South direction vs. n wires 
(i.e., E1_W1 to En_Wn) in the East-West direc- 
tion to support diagonal connections. This fact is 
dictated by the corner partition of the switch. In 
other words, refer to both Figure 9 and Figure 1 1 , it 
is obvious that pins C2_S1 to C2_Sn are as- 
signed to PE C2_n.1 to PE C2_n.n in chip 2, 
while pin C2_S0 in chip 2 is assigned to PE 
C1_n.n located at chip 1 to support its diagonal 
connection to chip 4, as shown in Figure 10. 

Up to four switches containing a multiplexer, 
demultiplexer, and a configuration register group- 
ings can be used at each switching point in the 
interconnection network. When four such switches 
are used, an electronic crossbar communications 
switch is formed between the four processing ele- 
ments in the group, allowing for simultaneous com- 
munications between all four processing elements 
in the group. 

The invention described above is, of course, 
susceptible to many variations, modifications and 
changes, all of which are within the skill of the art. 
It should be understood that all such variations, 
modifications and changes are within the spirit and 
scope of the invention and of the appended claims. 
Similarly, it will be understood that Applicant in- 
tends to cover and claim all changes, modifications 
and variations of the example of the preferred 
embodiment of the invention herein disclosed for 
the purpose of illustration which do not constitute 
departures from the spirit and scope of the present 



invention. 
Claims 

5 1. A dynamically reconfigurable switching means 
(14) in a SIMD architecture having a two di- 
mensional array (20) of processing elements 
(10), where a controller (18) broadcasts instruc- 
tions to all processing elements (10) in the 

jo array (20), the switch (14) being useful to con- 
nect four of the processing elements (10) in 
the array (20) into a group in accordance with 
either the broadcast instruction of the controller 
(18) or a special communication instruction 

15 held in one processing element (10) of the 
group, the switch (14) comprising: 

at least one data line (54) connected to 
each of the processing elements (10) in the 
group; 

20 a multiplexer means (48) connected to 

each data line (54) and to the controller (18) 
and to a configuration register, adapted to load 
the special communication instruction from the 
one processing element (10) in the group into 

25 a configuration register and to operate in ac- 
cord with either the broadcast instruction from 
the controller (18) or the contents of the con- 
figuration register to select one of the four data 
lines (54) as a source of data and applying the 

30 data therefrom to a source output port; 

a demultiplexer means (50) connected to 
each data line (54) and to the controller (18) 
and to said configuration register, and to the 
source output port of the multiplexer means 

35 (48), and adapted to operate in accord with 

either the broadcast instruction from the con- 
troller (18) or the contents of the configuration 
register to select one of the four data lines (54) 
and applying the data from the source output 

40 port of the multiplexer means (48) thereto. 

2. The dynamically reconfigurable switching 
means of claim 1 further characterized by at 
least one multiple bit width data line connected 

45 to a corresponding processing element (10), 
said processing element (10) having at least 
one input/output register (28) and at least one 
configuration register associated with selected 
bits of said at least one multiple bit width data 

so line. 

3. The dynamically reconfigurable switching 
means of claim 1 or claim 2, characterized by 
at least one copy of the switch (14) including 

55 said multiplexer means (48), said demultiplexer 
means (50) and said configuration register be- 
tween each group of four processing elements 
(10). 
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4. The dynamically reconfigurable switching 
means of any of claims 1 - 3, further character- 
ized by no more than four copies of the switch 
(14) including said multiplexer means (48), said 
demultiplexer means (50) and said configura- 5 
tion register between each group of four pro- 
cessing elements (10). 

5. The dynamically reconfigurable switching 
means of any of claims 1 - 4, characterized in w 
that said demultiplexer means (50) is further 
adapted to operate in accord with either the 
broadcast instruction from the controller (18) or 

the contents of the configuration register to 
apply the data from the source output port of 15 
the multiplexer means (48) to at least two of 
the data lines (54) in the group of four data 
lines (54). 

6. A dynamically reconfigurable switching means 20 
in a SIMD architecture having a two dimen- 
sional array (20) of processing elements (10), 

with multiple chips containing the processing 
elements (10) of the array (20), where a con- 
troller (18) broadcasts instructions to all pro- 25 
cessing elements (10) in the array (20), the 
switch being useful to connect four of the 
processing elements (10) in the array (20) into 
a group which may cross chip boundaries to 
form partitions with each partition associated 30 
with one chip, to direct data movement dynam- 
ically between selected processing elements 
(10) of the group in accordance with either the 
broadcast instruction of the controller (18) or a 
special communication instruction held in one 35 
processing element (10) of the group, the 
switch (14) comprising in each partition: 

at least one data line (54) connected to 
each processing element (10) in the group in 
the partition; 10 

a multiplexer means (48) connected to 
each data line (54) and to the controller (18) 
and to a configuration register, adapted to load 
the special communication instruction from one 
processing element (10) in the group into a 45 
configuration register and to operate in accord 
with either the broadcast instruction from the 
controller (18) or the contents of the configura- 
tion register to select one or none of the data 
lines (54) in the partition as a source of data so 
and applying the data therefrom to a source 
output port; 

a demultiplexer means (50) connected to 
each data line (54) and to the controller (18) 
and to said configuration register, and to the ss 
source output port of the multiplexer means 
(48), and adapted to operate in accord with 
either the broadcast instruction from the con- 



troller (18) or the contents of the configuration 
register to apply the data from the source 
output port of the multiplexer means (48) to a 
selected one or none of the data lines (54) in 
the partition; and, 

a data line (54) connecting each multi- 
plexer (48) in one partition to the demultiplexer 
(50) in the same partition, and a crossing data 
line (54) connecting each multiplexer (48) in 
one partition to the demultiplexer (50) in each 
other partition. 

7. The dynamically reconfigurable switching 
means of claim 6, characterized in that said 
data line (54) connecting each multiplexer 
means (48) in one partition to the demul- 
tiplexer means (50) in the same partition and 
said crossing data line (54) connecting each 
multiplexer (48) in one partition to the demul- 
tiplexer (50) in each other partition is a single 
data line (54). 

8. The dynamically reconfigurable switching 
means of claim 6 or claim 7, characterized by 
at least one multiple bit width data line con- 
nected to a corresponding processing element 
(10), said processing element (10) having at 
least one input/output register (28) and at least 
one configuration register associated with se- 
lected bits of said at least one multiple bit 
width data line. 

9. The dynamically reconfigurable switching 
means of any of claims 6 - 8, characterized in 
that said demultiplexer means (50) is further 
adapted to operate in accord with either the 
broadcast instruction from the controller (18) or 
the contents of the configuration register to 
apply the data from the source output port of 
the multiplexer means (48) to at least two of 
the data lines (54) in the partition. 

10. The dynamically reconfigurable switching 
means of any of claims 6 - 9, characterized by 
at least one copy of the switch (14) including 
said multiplexer means (48), said demultiplexer 
means (50) and said configuration register be- 
tween each group of four processing elements 
(10). 

11. The dynamically reconfigurable switching 
means of any of claims 6-10, characterized 
by no more than four copies of the switch (1 4) 
including said multiplexer means (48), said de- 
multiplexer means (50) and said configuration 
register between each group of four process- 
ing elements (10). 
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12. The dynamically reconfigurable switching 
means of any of claims 6-11, characterized in 
that said multiplexer means (50) is further 
adapted to load the special communication in- 
struction from one processing element (10) in 5 
the group into a configuration register and to 
operate in accord with either the broadcast 
instruction from the controller (18) or the con- 
tents of the configuration register to select one 
or none of the data lines (54) in the partition as 10 
a source of data and applying the data there- 
from to a source output port selected in one of 
the partitions. 

1a The dynamically reconfigurable switching 75 
means of any of claims 6-12, characterized in 
that said demultiplexer means (50) is further 
adapted to operate in accord with either the 
broadcast instruction from the controller (18) or 
the contents of the configuration register to 20 
select only one source output port from all the 
partitions and to apply the data from the se- 
lected source output port to a selected one or 
none of the data lines (54) in its partition. 
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